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Abstract 



o 

(N 

, Pinsker's widely used inequality upper-bounds the total variation distance \\Q — P^ 

in terms of the Kullback-Leibler divergence D(Q\\P). Although in general a bound in 
| /' | , the reverse direction is impossible, in many applications the quantity of interest is ac- 

tually D*(P,e) — defined, for an arbitrary fixed P, as the infimum of D(Q\\P) over all 
£vq , distributions Q that are e-far away from P in total variation. We show that D*(P,e) < 

Ce 2 + 0(e 3 ), where C = C{P) — 1/2 for "balanced" distributions, thereby providing a 
kind of reverse Pinsker inequality. Some of the structural results obtained in the course 
of the proof may be of independent interest. An application to large deviations is given. 



1 Introduction 



j> | 1.1 Background 

Since its publication in 1964, Pinsker's inequality [23] has become a ubiquitous tool in proba- 
bility [21 [20] , statistics [19] . information theory [I], and, more recently, in machine learning 
0. Letting 

<N; D(Q\\P) = J log(^)dQ 

denote the Kullback-Leibler divergence between the distributions P and Q and V(Q, P) 

\\Q-p\\ 

1 be their total variation distance, Pinsker's inequality states that 

D(Q\\P)>V 2 (Q,P)/2. (1) 

(Pinsker's original result had a worse constant; the bound was gradually improved by [U 
[ISl H71 HE1 HH1 E21 EH EH [28] and others; see [25] for a detailed history and the "best possible 
Pinsker inequality". Recent extensions to general /-divergences may be found in [14J and 
[25].) 

While some upper bounds on the KL-divergence in terms of other /-divergences are known 
[El ttU [12] , in general, it is impossible to upper-bound D(Q\\P) in terms of V(Q,P), since 
for every v G (0, 2] there is a pair of distributions P, Q with V(Q, P) = v and D(Q\ \P) = oo. 

However, in many applications, the actual quantity of interest is not D{Q\\P) for arbitrary 
P and Q, but rather 

D*(P,e)= inf D(Q\\P). (2) 

Q:V(Q,P)>e 
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For example, Sanov's Theorem [61 [9] (about which we will have more to say below) implies 
that the probability that the empirical distribution P n , based on a sample of size n, deviates 
in i\ by e or more from the true distribution P behaves asymptotically as exp(— nD*(P, e)). 

In this paper, we show that for the broad class of "balanced" distributions, D*(P,e) < 
e 2 /2 + 0(e 4 ), which matches the form of the bound in ([1]). For distributions not belonging 
to this class, we show that D*(P,e) = e 2 /8/3(l — j3) — + (e 3 ), where /3 is a measure of the 
"imbalance" of P defined below; this may also be interpreted as a reverse Pinsker inequality. 



1.2 Related work 

In [23j , Ordentlich and Weinberger considered the following distribution-dependent refinement 
of Pinsker's inequality. For a distribution P with balance coefficient /3, define <p(P) by 



2(3-1 ° 1 - 
(for = 1/2, cp(P) = 2). It is shown in (23J that 

D{Q\\P) > ^p-V{Q,P) 2 , (3) 

for all Q,P, and furthermore, that ip(P)/4 is the best P-dependent coefficient possible: 

D(Q\\P) =m 
Q V(Q,P) 2 4 w 

Although the left-hand sides of (JH) and ([2]) bears a superficial resemblance, the two quantities 
are quite different (in particular, the latter is constrained by V(Q, P) > e). While distribution- 
independent versions of ([3]) exist (viz., Pinsker's original inequality), our main result (Theorem 
[I]) does not admit a distribution-independent form. Simply put, the result in [23] yields a 
lower bound on D*(P, e), while we seek to upper-bound this quantity — and actually compute 
it exactly for unbalanced distributions. We remark that though some of our proof techniques 
are similar to those in [23], we opted for a self-contained treatment. 



2 Main results 

Throughout this paper, we have in mind a (finite or cx-finite) measure space (Q, J^~, //), and all 
the distributions in question will be defined on this space and assumed absolutely continuous 
with respect to jx; this collection of distributions will be denoted by V . The balance coefficient 
of a distribution P is 

(3 = mi{P(F) : F £ 3?,P(F) > 1/2} . 

A distribution is balanced if j3 = 1/2 and imbalanced otherwise. Observe that all non-atomic 
distributions on R are balanced. A stronger condition on P is that of having full range. Here, 
the range of a distribution is 

H{P) = {P{F) : F G ^} . 



2 



A distribution P has full range if 1Z(P) = [0, 1]. Non-atomic distributions on K have full range, 
as do, e.g., geometric distributions with parameter p > 1/2. Distributions P = (pi,P2, ■ ■ •) 
over N, satisfying p n+ \ > p n /2 for all sufficiently large n, admit an effective calculation of the 
range as a finite union of segments. 

It will be useful to define the function KL 2 : (0, l) 2 — > [0,oo) by 

p 1 — p 

KL 2 (p,q) =plog- + {l-p) log- 

q l-q 

and the so-called Vajda's tight lower bound L |28j : 

L(e) = inf D(Q\\P). 

P,Q:V(Q,P)=6 

An exact parametric equation of the curve (e, i(e))o<t<oo in M 2 is given in [13]: 



x(t) = Jl - ^cothi- ij 



2 X 



t , t 2 



y(t) = log^— — \-t coth t 



sinh t sinh 2 t 

We can now state our reverse Pinsker inequality: 

Theorem 1. Suppose P G "P /ias balance coefficient f3. Then: 

(a) For j3 > 1/2 and < e < 4/3 - 2, 

D*(P,e) = KL 2 (/3-e/2,/3). 

(b) For (5 = 1/2 and < e < 1, 

L(e) < L>*(P,e) 

< KL a (l/2-e/2,l/2). 

As a comparison of orders of magnitude, note that 

KL2 <' 3 - £ / 2 -« = 8«T^) E2 

^ + 0(s* 



48/3 2 (l -/3) 2 
KL 2 (l/2-e/2,l/2) = e 2 /2 + e A /l2 + 0(e 6 ), 
L(e) = e 2 /2 + e 4 /36 + n + (e 6 ), 

where the last expansion is well-known |13j . 

Despite the deceptive similarity of parts (a) and (b) in Theorem [H (b) is not a special 
or limiting case of (a). The point (as will be clear from the proofs) is that in case (a), for 
all sufficiently small e > 0, the minimum on the right-hand side of ([2]) is obtained by the 
same distribution Q. In case (b), the optimal Q in ([2]) may vary with e, at least in principle. 
In particular, when P has full range, Q can vary in an essentially unconstrained fashion. 
Moreover, in case (a) we have equality, while in case (b) — only an upper bound. 

Theorem 2. If P G V has full range, then for < e < 2, 

D*(P,s) = L{e). 
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3 Proofs 



We will consistently use upper-case letters for distributions P G V and corresponding lower- 
case letters for their densities p with respect to [i. (For discrete distributions, we will often 
blur the distinction between P and p.) 

Our first lemma provides a structural result for extremal distributions. Suppose a dis- 
tribution P G V is given, along with an A G & and an < e < 2(1 — P(A)). Denote by 
Q(P, A, e) the collection of all distributions Q G V for which F(Q, P) = e and 

G f2 : p(u;) < C A. 

Lemma 3. For all P G P ; AG ^ iwit/i < P(A) < 1, and e G (0, 2(1 - P(A))], let Q* G V 
have the density 

q* = (ul A + vl n \ A )p, (5) 

where 

e P 

U = l+ — — 7T , W = 1 



2P(A)' 2(1 -P(A))' 

T/ien: 

^ Q* fee^ongs to Q(P,A,e). 

(b) Q* is the unique minimizer of D(Q\\P) over Q G Q(P,A,e). 
Proof. (a) Trivial. 

(b) Consider first the case of finite Q = {1, . . . , k}, where n is the counting measure. By 
a standard compactness argument, we have that some Q = (qi, ■ ■ ■ ,<Jk) G Q(P,A,e) 
achieves the minimal value of D(Q\\P). We claim that % > pi for all i G A — that 
is, the minimum cannot be attained on the boundary of Q(P,A,e). Suppose to the 
contrary that qj = pj for some j G A. Choose some i G A \ {j} for which q^ > pi and 
define Qs to be identical to Q with two exceptions: qsj = Pj + 5 and qsi = qi — 5, for 
any sufficiently small 5 > 0. Note that V(Q S ,P) = £ and define F(5) = D(Q 5 \\P). 
Differentiating, we get 



P'(«5) = log^--log 



qi - 5 pj + 5' 

Since F'(0) = Iog(pi/ft) < 0, there exists a (5 > for which D(Q 5 ||P) < D(Q\\P) — a 
contradiction. 

Since Q cannot lie on the boundary of Q(P,A,e), the gradient of G(Q) = D(Q\\P) 
must vanish at Q. In particular, assume without loss of generality that {1,2} G A and 
consider the distribution Q z = (z,c— z, qs, q±, . . . , q^j, where c = q~\ + q\ and z G [0, c]. 
Taking H(z) = D(Q Z \\P), we have that H'(q{) = 0. Therefore, 

P / (<7 1 ) = log^-log^ = 0. 

Pi P2 
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Applying this argument to every distinct i,j £ A, as well as every distinct i,j ^ A, we 
conclude that 

upi, i £ A, 
vpi, i A, 

for some constants u, v > 0. To compute u and v, observe that they satisfy the two 
equations 

1 = uP{A) +v{\- P(A)) 

e = (u-l)P(A) + (l-v)(l-P(A)), 

whose solution yields ([5j). This proves the claim for finite fi. 

The general (infinite Q) case can be obtained from the finite case via the data process- 
ing inequality \29\ Theorem 9] and [29[ Theorem 10], which show that D{Q\\P) may be 
approximated arbitrarily well by D(P^\\Q^), where £? is an appropriate finite refine- 
ment of {A, Q\A}. Thus, we rule out the possibility of a Q' 6 Q(P,A,e) for which 
D(Q'\\P) < D{Q*\\P). The uniqueness of the minimizer Q* follows from the strict 
convexity of D [29\ Theorem 11]. 

□ 

Our next result is that D* actually has a somewhat simpler form than in ([2]), and the 
infimum is always attained. 

Lemma 4. For all distributions P and all e > 0, 

D*(P,e)= min D(Q\\P). 

Q:V(Q,P)=S 

Proof. We first consider the case of finite O = {1, . . . , A;}. Let P and e be given. By standard 
compactness and continuity arguments there is a Q* 6 V with V(Q*,P) > e and D{Q*\\P) = 
D*(P,s). 

Now suppose that Q £ V is such that V(Q, P) = e' > e. Consider the distribution 
Q s = Q + S(P - Q) for 5 e [0, 1]. Observe that V(Q S , P) = (1 - 6)V(Q, P) < e' and define 

F(5) = D(Q 5 \\P). 

We would like to show that, for 5 sufficiently small, F(6) < F(0) = D(Q\\P). To this end, it 
suffices to show that ^'(0) < 0. Indeed, 

k 



F'(S) = g(^-^)( 1 + 1 ° g ^"p? ) + 9 



i=l 
k 



F'(0) = 



i=l 



, Pi , Qi 

Pi-Qi- Pi log qi log — 

Qi Pi 



= -D{P\\Q) - D(Q\\P) <0. 
Hence, Q* cannot satisfy V(Q*,P) > e; this proves the claim for finite $7. 
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We extend the proof to general as follows. First, we claim that it suffices to consider 
io^QeV:V(Q,P)=e D(Q\\P). Indeed, for any Q G V with V(Q,P) = e' > e we can define Qs 
and F(S) exactly as in the finite case, and an analogous argument shows that P'(0) < 
(differentiation under the integral is justified by Lebesgue's dominated convergence theorem). 
Hence, 

inf D(Q\\P)= inf D(Q\\P), 

QeQ>(P,A,e) " QeQ(P,A,e) 

where Q>(P, A,e) is defined analogously to Q(P,A,e), with the constraint V(Q,P) = e 
replaced by V(Q,P) > e. The optimization problem over distributions Q G V has now been 
reduced to one over sets A G 

D*(P,e) = inf inf D(Q\\P), 

Ae& s QeQ(P,A,s) 

where ^ s = {A G & : P(A) < 1 — s/2}. Lemma [3] shows that the inner infimum is achieved 
by a unique Qa G Q(P,A,e), and thus, 

D*(P,e)= inf D(Q A \\P). 

Define the distance A over ^ e by 

A(A,B) = P((A\B)U(B\A)); 

modulo sets of P-measure zero, (JP £ , A) is a bounded complete metric space. For 5 > 0, put 
J^^ = {i £ 5 £ : < D*(P,e) + 5}. Each is easily seen to be nonempty and 

closed; additionally, these sets form a nested family: & £ $ 3 ^ e 5' for 5 > 5'. Thus, 



n = n * 

8>0 



is also non-empty and every A G satisfies D*(P,e) = D(Qa\\P)- □ 

To state our next lemma, consider some finite measurable partition $rf = {A\,... , A^ 
and define the projection operator 7r i0 / : Q — > srf ', which induces the map P — >• taking 
a distribution P to its "collapsed" form (P(A\), . . . , P(Ak)). We claim that a collapse can 
never decrease the value of D*: 

Lemma 5. Let si = {^4i, . . . ,Ak}, k > 2, be a partition of £1. Then, for all distributions P, 

D*(tt^(P),£) > D*(P,e), e>0. 

Proof. We will omit the subscript srf for readability. Let Q E (P) = {Q G V : V(Q,P) = e} 
and similarly, let Q' £ {P) be the collection of distributions on a set of k objects that are e-far 
from P' = tt(P) = . ..,l/ Ak ). For given P G V and Q' = . . . ,q' Ak ) G Q^(P), define 

Q G P by its density g: 

g(w) = ^9^, w G n, At = vr-^M); 
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it is trivially verified that Q is a valid distribution. Then 



V(Q,P) 




Jn 



t[ P^- PJ 7 



k 




E>- 



k 




V(Q',P'), 



which shows that Q G Q £ {P), and 



D 



(Q\\P) = [ gHlog^^ 

J pM 




Thus, for each P G V and Q' G Q' £ (P), there is a Q G Q e (P) such that D(Q\\P) = 
D(Q'\\ir(P)); in particular, this is true for the minimizer Q'* of D(Q'\\P) over Q' £ (P). □ 

Our next lemma will allow us to restrict our attention to distributions with binary support. 
Now it is well known [13] that for each pair of distributions Q, P, there is a pair of binary 
distributions Q',P' such that V(Q',P') = V(Q,P) and D(Q'\\P') = D(Q\\P) (this fact is 
generalized to general /-divergences in [15]). However, in our case P is fixed whereas only Q 
is allowed to vary, and so this result is not directly applicable. Still, the intuition turns out 
to be correct: 

Lemma 6. Let P be a distribution on Q, whose support contains at least two points, and 
e > 0. Then there is a measurable partition g/ = {A, B} such that 



Proof. By LemmaH there is a Q* G V with V(Q*,P) = e and D(P\\Q*) = D*(P,e). Define 
A, B G & by A = {oj G Q : p(u) < q*(oo)} and B = Q \ A and put srf = {A, B}. Then 



by the data processing inequality [221 Theorem 9]. Trivially, V(tt^(Q*),tt^(P)) = V(Q*,P) = 



D*(P,e) = D*MP),e). 



D*MP),e)>D*(P,e) 



by Lemma [5] and 



DMQ*)\\ir^(P)) < D(Q*\\P) = D*(P,s) 



e, and thus g/ has the properties claimed. 



□ 
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We now turn to studying binary distributions P and the corresponding Q* satisfying 
D(Q*\\P) = D*(P,e). 

Lemma 7. Let P = (p$, 1 — po) be a binary distribution with po > 1/2 and e £ (0, 2po]. Then 
the unique Q* satisfying V(Q*,P) = e and D(Q*\\P) = D*(P,e) is 

Q*=(po-f,l-Po + |). 

Proof. By LemmalU there are at most two possibilities for Q*, namely Q* = Qi = (po — §, 1 — Po + § ) 
and Q* = Q2 = (po + f , 1 — — f ) • (Actually, if e > 2(1 — po) then only Q\ is a valid dis- 
tribution.) To prove that Q* = Qi, it suffices to show that 

F(e) = KL 2 (p - e/2, Po ) - KL 2 { Po + e/2,p ) < 
for all e G (0, 2(1 — po)]. To this end, we expand 

F(e) = 

00 



I ( _i 1 A F 2n+1 

;1 2 2 "+!(2n + l)n Vpq" ( X " ^o) 2n / 



which is negative since po > 1 — po . □ 

We are now in a position to prove the two main theorems: 
Proof of TheoremUl 

(a) For each e > 0, Lemma [6] implies the existence of a partition srf = {A, Q \ A} such that 
the distribution P' = (pojPl) = ^^{P) satisfies D*(P',e) = D*(P,e). In the imbalanced 
case, there is no loss of generality in taking po > 1/2. Then Lemma [7] shows that the 
corresponding optimal binary Q' is given by Q' = (po — e/2, 1 — po + e/2). Define the 
function 

F{x,5) =KL 2 (x -5,x), 

where 5 = e/2. We claim that F is increasing in x on [1/2 + 5/2, 1] for every fixed 5. 
Indeed, 

dF _ ■5+(l-x)xlog(l-|)-(l-x)xlog I j^ 
dx ~ (1— x)x 

To show that J^F > for x G [1/2 + 5/2, 1], we note that -§^F{x, 0) = and 

d 2 F 1 111 

+ - 



d5dx 1 — x 1 — x + 5 x x — 5 
5(2x -1-5) 
x(l — x){x — 5){1 — x + 5). 

It follows that any tt^/ (P) that minimizes D* {n^ {P) , e) must select the minimum value 
of po (> 1/2), namely the balance coefficient j3. 
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(b) Only the second inequality requires proof. We may choose the (not necessarily optimal) 
collapsed distribution P' = (1/2, 1/2). Taking Q' = (1/2 - e/2, 1/2 +e/2) and invoking 
Lemma [5j we obtain 



D*(P,e) < D*(P',e) < KL 2 (l/2, 1/2 - e/2). 



□ 



Proof of Theorem^ If P has full range, then for every binary distribution P' there is a 
partition s/ with tt^(P) = P'. Since any feasible (V,D) = (V(Q,P),D(Q||P)) 6 M 2 is 
realized by some binary P' , Q' [13], it follows that 

D*(P,e)= inf £>(Qi||Q 2 )- 

Ql,Q 2 eV:V(Qi,Q 2 )=e 

□ 



4 Application: convergence of the empirical distribution 

In [4J wc examined the convergence of the empirical distribution to the true one in the total 
variation norm. More precisely, we considered a sequence of i.i.d. N- valued random variables, 
Xi, X2, ■ ■ ■ distributed according to P = (pi , P2, ■ ■ ■) and denoted 

J n = V(P,P n ), n G N, 

where P n is the empirical distribution induced by the first n observations. The main contri- 
bution of [4] was an analysis of the rate at which EJ n decays to zero; for example, it was 
shown that 



and that for P with finite support of size k, 



n 



EJ n < J-. 
V n 

In greater generality, we showed that 

\K n - n" 1 / 2 < EJ n < A n , n > 2, 

where 

An(P) 



Pj>l/n ypj<l/n pj<l/n 



0.. 
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although the rate at which A n (P) decays to zero may be arbitrarily slow, depending on P. 

Since the map (Xi, . . . ,X n ) i-> J n is 2/n-Lipschitz continuous with respect to the Ham- 
ming distance, McDiarmid's inequality |21j implies 

P(\J n - BJ n \ > e) < 2exp(-ne 2 / 2 ) (6) 

for n E N,e > 0. Being a rather general-purpose tool, McDiarmid's bound in many cases 
does not yield optimal estimates. To examine the tightness of ©, we recall Sanov's Theorem 
[6, 9j, which yields 

- lim -logP(J n > e) = D*(P,e). 

n— >oo n 

Since for balanced distributions D*(P,e) < e 2 /2 + 0(e 4 ), our estimate in ([6]) actually has 
the optimal constant 1/2 in the exponent. (See [H Theorem 1] for other instances where the 
quantity e 2 /2 emerges in the exponent.) 

Combining the bound of Ordentlich and Weinberger ([3]) with Theorem [lj we get 

' log^-e 2 < D*(P,e) 



4(2/3-1) °l-/3 

= KL 2 (/3-e/2,/3) 

As a consistency check, one may verify that 



4(2/3-1) 1 - p ~ 8/3(1 - P) 
for 1/2 < /3 < 1. 
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